NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud Systems

https://doi.org/10.1145/3698038.3698568

Sruthi, P C; Guo, Zinan; Chu, Deming; Chen, Zhengyan; Zhang, Yongle (November 2024, ACM)

Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail. In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments. Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools.
more » « less
Full Text Available
Vicious Cycles in Distributed Software Systems

https://doi.org/10.1109/ASE56229.2023.00032

Qian, Shangshu; Fan, Wen; Tan, Lin; Zhang, Yongle (September 2023, International Conference on Automated Software Engineering (ASE) 2023)
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

https://doi.org/10.1145/3552326.3587448

Tang, Lilia; Bhandari, Chaitanya; Zhang, Yongle; Karanika, Anna; Ji, Shuyang; Gupta, Indranil; Xu, Tianyin (May 2023, 18th European Conference on Computer Systems (EuroSys '23))

Full Text Available
Facilitating Global Team Meetings Between Language-Based Subgroups: When and How Can Machine Translation Help?

https://doi.org/10.1145/3512937

Zhang, Yongle; Asamoah Owusu, Dennis; Carpuat, Marine; Gao, Ge (March 2022, Proceedings of the ACM on Human-Computer Interaction)

Global teams frequently consist of language-based subgroups who put together complementary information to achieve common goals. Previous research outlines a two-step work communication flow in these teams. There are team meetings using a required common language (i.e., English); in preparation for those meetings, people have subgroup conversations in their native languages. Work communication at team meetings is often less effective than in subgroup conversations. In the current study, we investigate the idea of leveraging machine translation (MT) to facilitate global team meetings. We hypothesize that exchanging subgroup conversation logs before a team meeting offers contextual information that benefits teamwork at the meeting. MT can translate these logs, which enables comprehension at a low cost. To test our hypothesis, we conducted a between-subjects experiment where twenty quartets of participants performed a personnel selection task. Each quartet included two English native speakers (NS) and two non-native speakers (NNS) whose native language was Mandarin. All participants began the task with subgroup conversations in their native languages, then proceeded to team meetings in English. We manipulated the exchange of subgroup conversation logs prior to team meetings: with MT-mediated exchanges versus without. Analysis of participants' subjective experience, task performance, and depth of discussions as reflected through their conversational moves jointly indicates that team meeting quality improved when there were MT-mediated exchanges of subgroup conversation logs as opposed to no exchanges. We conclude with reflections on when and how MT could be applied to enhance global teamwork across a language barrier.
more » « less
Full Text Available
Leveraging Machine Translation to Support Distributed Teamwork Between Language-Based Subgroups: The Effects of Automated Keyword Tagging

https://doi.org/10.1145/3411763.3451837

Zhang, Yongle; Asamoah Owusu, Dennis; Gong, Emily; Chopra, Shaan; Carpuat, Marine; Gao, Ge (May 2021, 2021 CHI Conference on Human Factors in Computing Systems)
null (Ed.)
Full Text Available

Search for: All records